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We investigate the performance of some common machine learning techniques in identifying Blue Horizontal Branch (BHB) stars 
from photometric data. To train the machine learning algorithms, we use previously published spectroscopic identifications of BHB 
stars from Sloan Digital Sky Survey (SDSS) data. We investigate the performance of three different techniques, namely k nearest 
neighbour classification, kernel density estimation for discriminant analysis and a support vector machine (SVM). We discuss the 
performance of the methods in terms of both completeness (what fraction of input BHB stars are successfully returned as BHB stars) 
and contamination (what fraction of contaminating sources end up in the output BHB sample). We discuss the prospect of trading 
off these values, achieving lower contamination at the expense of lower completeness, by adjusting probability thresholds for the 
classification. We also discuss the role of prior probabilities in the classification performance, and we assess via simulations the 
reliability of the dataset used for training. Overall it seems that no-prior gives the best completeness, but adopting a prior lowers the 
contamination. We find that the support vector machine generally delivers the lowest contamination for a given level of completeness, 
and so is our method of choice. Finally, we classify a large sample of SDSS Data Release 7 (DR7) photometry using the SVM 
trained on the spectroscopic sample. We identify 27,074 probable BHB stars out of a sample of 294,652 stars. We derive photometric 
parallaxes and demonstrate that our results are reasonable by comparing to known distances for a selection of globular clusters. 
We attach our classifications, including probabilities, as an electronic table, so that they can be used either directly as a BHB star 
catalogue, or as priors to a spectroscopic or other classification method. We also provide our final models so that they can be directly 
applied to new data. 
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1. Introduction 

The blue horizontal branch (BHB) stars are old, metal-poor halo 
stars. They are of interest as tracers of Galactic structure be- 
cause they are more luminous than most giant branch or pop- 
ulation II main sequence stars, have a narrow range of intrin- 
sic luminosities (hence 'horizontal branch') and display spec- 
tral features rendering them identifiable, in particular a strong 
Balmer jump and narrow strong Balmer lines. There is therefore 
an interest in building large, reliable samples of them, particu- 
larly in the context of wide-field halo surveys such as the Sloan 
digital sky survey (SDSS) and the forthcoming Pan-Starrs sur- 
vey. BHB stars are always of interest whenever halo structure is 
studied due to their strength as distance indicators. Recent stud- 
ies w hich have concentrated on BHB stars to trace structure in- 
clude |H^ ^2010), who searched for moving groups 
in the halo. lXue et a l. (2009) who used them to s earch for close 
pairs, implying the existence of halo substructure, Kinman et al. 
(2009) who searched fo r a population of BH B stars associated 
with the thick disk, and Ruhland et al. (2010), who investigated 
structure in the Sagittarius dwarf and streams. The main prob- 
lem with BHBs as tracers is their relative sparseness compared 
to other tracers such as turnoff stars. This means that large, pure 
samples are highly desirable for structure tracing studies. 

In this paper, we take as our lead several recent studies of 
BHB spectra from SDSS/SEGUE and attempt to use the reliable 
and large samples of BHBs detected as a training set to build 



models aimed at identifying BHBs from the photometry alone. 
With this tool, we hope to be able to extend the available sample 
of known (or better, strongly suspected) BHB stars from SDSS 
and other surveys, with a view either to use our sample directly 
to trace structure, or at least to guide follow-up studies with spec- 
tra. 

Th e main three studies we follow are those of lYannv et al.l 
d2000l) . who identified a colour cut in the u-g, g-r colour- 
colou r diagram t hat yie lds most of the available BHB popula- 
tion, ISirko et all d2004l) who used spectra to identify a reliable 
sample of 700- 1000 BHBs (the size of the sample depends on the 
g magnitude an d the reliability desired), and most importantly 
IXueetal.l(l2008l) . who an alysed a sam ple of SDSS DR6 data us- 
ing similar techniques to ISirko et all and extende d the reliab le 
list of BHBs to over 2,500 objects. The method of IXue et alJ is 
discussed in more detail in Sect. 13.31 

We have selected three machine learning methods to in- 
vestigate. These are a k-Nearest Neighbour (kNN) technique, 
a kernel density estimator (KDE) and a support vector ma- 
chine (SVM). We also apply the dec ision boundary in (u - 
g, g - r) colour space suggested by lYannv et al l (120001) for 
comparison. After the colour cut, the kNN method is proba- 
bly the simplest algori thm we consider. One exa mple of its 
use can be found in iMarengo & Sanchea d2009t) . Examples 
of K DE use in class i fication problems in astrophysics in- 
clude iGao et al l d2008l). IRichards et all d2009bl) . iRichards et al.l 
(2009al) and lRuhland et alJ (l2010h . The SVM works by identi- 



* Tables [7] IA.3I and IA.4I are only available in electronic form at the 
CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128.5) or via 
http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/ 



fying a decision boundary in a multidimensional space ( Vapnik 
1995) (in this case the space of SDSS colours) based on a train- 
ing set containing examples of two or more classes of object 
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- for our purposes BHB stars and non-BHB contaminants. The 
SVM performance should be equivalent to that obtainable with 
a neural network, but it has the advantage of being highly adapt- 
able and relatively easy to use. Its main drawback is its in- 
ability to provide genuine probability estimates for classes, be- 
cause it does not model the distribution of the data. This is dis- 
cussed further in Sect. 14.41 SVMs have been used on classifica- 
tion problem s by various authors, for example [Tsalmantza et alj 
( 20071 120091) who developed a galaxy library for the Gaia mis- 
sion a nd explored classification problems therein, iGao et all 
(2008) w ho used them to sea rch for quasars in SDSS data, and 
Huertas-Comp anv et alj d2009l) who used them for m orpholog- 
ical galaxy classification. iBailer- Jones et al.l J2008) discussed 
SVM classification of astrophysical sources in the context of un- 
balanced samples. 

We proceed by taking the sample of IXue et alj and obtain- 
ing the up-to-date photometry for it from SDSS DR7. We then 
investigate the ability of each of the three techniques to recover 
the BHB stars from the Xue sample, and the various options that 
are available to optimize them. Finally, we take a new sample 
of DR7 photometry, for sources without spectra, and apply our 
models to this sample to recover samples of probable BHB stars. 
We use a selection of globular clusters of known distance to test 
the BHB classifications and photometric parallaxes. 
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Fig. 1. The sample of IXue et al.l d2008l) in the u — g , g - r plane, 
with DR7 data. BHB stars (according to lXue et all) are shown as 
red crosses, non- BHB stars as blac k p oints. The box show s the 
colour cut used bv lXue et alj d2008l) and lYannv et al.l (|2000). The 
outer boundary extends 0.1 magnitudes beyond this. Sources 
outside the plot region were discarded. 



2. Data 

The latest publicly available SDSS data release, DR7, covers ap- 
proximately 8400 square degrees, with images in the five SDSS 
bands: u,g,r,i,z. Spectra are available for a subset of the de- 
tected objects based on various selection criteria. 

The study of IXue et al.l used a sample of SPS S DR6 data 
selecte d to lie inside the colour box suggested by lYannv et"aT] 
d2000b (0.8 < u - g < 1.6, -0.5 < g - r < 0.0). We have 
recovered the sources used by Hue et al.l in the DR7 release by 
matching the SDSS MJD, plateld and fiberld fields. We obtained 
the PSF magnitudes, estimated extinction, and the parameters 
(Teff, logg, and [Fe/H]) as determined by the SDSS pipeline. 
The dereddened magnitudes were obtained from the model mag- 
nitudes during the pipeline processing by applying extin ction 
corrections derived from the map of Schlegel et al. ( 1998). We 
recover the extinction from the model magnitudes and apply it 
to the PSF magnitudes. The DR7 photomet ry is gene rally con- 
sistent with the DR6 photometry given by IXue et all to within 
hundredths of a magnitude, but there are a number of sources 
with more divergent values. We rejected the most discrepant of 
these by intro ducing a colo ur cut 0.1 magnitudes outside of the 
colour cut of I Yannv et all This cut excluded mostly contami- 
nant stars. The lXue et alj sample contained 10224 objects, of 
which 2558 were identified by them as BHB stars. After reject- 
ing sources with discrepant photo metry, 992 9 objects remained, 
of which 2536 were identified bv lXue et al.l as BHB stars. 

We also cross match ed against a list of 1172 objects from 
the paper o f ISirko et all All thes e objects w ere identified by 
Sirko et all as BHB stars. Since ISirko et all did not provide 
SDSS identities in their table, we cross matched first with our 
SDSS data on the basis of RA and Dec and g magnitude. After 
cross matching, 1 101 of the sources were identified in the SDSS 
DR7 data. Of these, 4 had no identified counterpart amongst the 
IXue et all objects. Figure Q] shows the colour-colour diagram of 
our sample. 



3. General Approach 

All our classification methods are supervised, meaning that they 
require samples of data for objects of known type in order to 
train a model, which can then be applied to new data. Various 
parameters must be set to optimize the classification, and in the 
end the reliability of the methods relative to each other and in 
absolute terms has to be determined in some way. For these rea- 
sons, we need a sample of test objects of known type on which 
we can run our traine d models. Our sample contains 2536 BHB 
stars as identified by IXue et all Our standard procedure was to 
randomly split the BHB sources into roughly equal training and 
testing sets, and then randomly select equal numbers of non- 
BHB sources (designated "other") to include in the training and 
testing samples. We can investigate the statistical properties of 
the results by bootstrapping. 

3.1. Data dimensionality and feature selection 

The PSF photometry was corrected for the expected extinction 
determined by the SDSS pipeline. There are four colours avail- 
able. The u-g and g—r colours are the most important for BHBs. 
The others show little or no real information when examined by 
eye. They make some difference (for the better) for the kNN and 
SVM methods, but tend to degrade the KDE. It is possible that 
the improvement seen for kNN and SVM by including the other 
colours is mostly due to excluding faint sources due to their large 
scatter in all bands. 

3.2. Comparing methods: completeness and contamination 

The completeness is defined as the number of correctly classified 
sources of a particular class, divided by the number of available 
sources of that class, i.e. it is the fraction of test sources of a 
particular class that are correctly classified, 

n '=jj 

completeness = -, (1) 
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Fig. 2. Simulation of the effect of noise onlXue et all s classification. The four spectral line parameters used in the classification are 
fm, DO. 2, c y , and b y ("see IXue et al. | (|2008j) for details). Four magnitudes are illustrated, for each of these, we plot DO. 2 versus fm 
and c y versus b y . The selection boxes are shown in each panel. The progressive loss of BHB stars (red crosses) from the selection 
box is clear. Non-BHB stars (black dots) are not scattered into the boxes at the same rate. 



where n, -j is the number of objects of true class i classified as 
output class j and Ni is the total number of input sources of class 
i. Input sources can be lost from the output class due to misclas- 
sification into another class, or by remaining unclassified due 
to an insufficiently high classification confidence. The contami- 
nation of the output sample is defined as the number of falsely 
classified sources of that class divided by the number of sources 
classified into that class, whether correctly or incorrectly, 

contamination, = J* 7 '' J , (2) 

In our particular case, one class, the set of non-BHB stars, is 
really a mixed class of contaminants comprising blue stragglers 
and main sequence stars, that we are interested in removing in 
order to obtain a clean sample of BHB stars. Therefore, we are 
interested in the completeness and contamination of the BHB 
sample and not primarily in the completeness or contamination 
of the "other" class. 

3.3. Reliability of the training and testing sets 

Since our method is based on the results o f lXue et al.L it is worth- 
while investigating how reliable these are, particul arly at th e 
faint end. To this end, we selected 1381 spectra from lXue et all s 
original data, having 14.5 < g < 15.5. Roughly half of these 
(655) were BHB stars. We then added artificial noise to degrade 
them to the same signal-to-noise ratio as fainter spectra. We con- 
structed in this way eight artificial samples in half magnitude 
steps from g- 16.0 to g-19.5. We then rea nalysed these degraded 
spectra with the technique of IXue et al.l and reclassified them. 
We then compare the performance at faint magnitudes with the 

original performance. 

The classification of IXue et al.l is based on recovering four 
different characteristic parameters from the absorption lines. 
These are DO. 2, the width at 20% below the continuum of the 
B aimer line, fm, the flux relative to the continuum at the line 
core, and c r a nd b y , which are parameters from a Sersic fit to 
the line shape (Sersic 1968). BHB stars are identified as lying 



within selection boxes in the feature space formed by the line 
parameters. This selection is illustrated for our degraded data 
in Fig. |2j which shows for four example magnitudes the DO. 2 
versus fm and c y versus b 7 values, and the selection boxes. It 
is clear from the figure that, as the noise increases, true BHB 
stars scatter outside one or the other selection box and are lost, 
decreasing the completeness. Some sources from outside the se- 
lection boxes scatter into the box, but since the box covers a 
small fraction of the data space, and since a contaminant has to 
scatter into both boxes to be misclassified as a BHB, the increase 
in the absolute number of contaminating sources is modest. The 
contamination, as defined above, will still increase because the 
number of true positives, in the denominator of Equation [2] is 
decreasing. Figure [3] shows the resulting ratio of objects classi- 
fied (rightly or wrongly) as BHB stars to total stars as a function 
of g. This is discussed in terms of the effect on the prior proba- 
bility in the following section. For now, we note that this effect 
kicks in strongly for sources fainter than g=19.0, and that it will 
degrade the quality of training sets used to define models and 
also of any testing set used to assess them. 

3.4. Priors 

The classifiers we use are trained on mixed samples of BHB and 
non-BHB stars with a range of properties. We implicitly assume 
that the classifier takes account of the likely distribution of the 
population of objects to be clasified, and if it does not, we need 
to correct the classifier output probabilities using appropriate pri- 
ors. 

The simplest prior to be accounted for is the true class frac- 
tion. To train our models, we use equal numbers of BHB and 
non-BHB stars, whereas the true class fr actions are not equal 
(the fraction of BHB stars in the sample of lXue et~a"D is approxi- 
mately 0.26 for all sources). For the kNN and KDE methods we 
could directly include the class fractions in the training sets. For 
the S VM we could also include proportional fractions of classes, 
but the actual effect of this on the classifier is complex and not 
well understood. We choose to always use equal class fractions 
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and use a prior to adjust the classifier output. We refer to this 
type of class fraction prior as a 'simple prior' hereafter. 

We can also adjust for prior probabilities as a function of 
other parameters that are not accounted for by the classifier it- 
self. We consider g magnitude and b, the Galactic latitude. In 
Fig. |4] we show the ratio of the density func tions of B HB stars 
and all stars as functions of g in the sample of lXue et~afl (dashed 
green line). Because these density functions were individually 
normalized, the resulting ratio is the relative fraction of BHB 
stars, rather than the absolute fraction - i.e. it is as if the class 
fractions were equal. 

Also shown in this plot is the ratio of BHB stars to all sources 
at each magnitude as estimated from the experiment described 
above in Sect. 13. 31 (red dash-dotted line). This curve, which is a 
renormalized version of the curve shown in Fig. [3] shows what 
we would expect to see if the fraction of BHB stars in the true 
population was constant with g, but the observed ratio was al- 
tered by sources being lost due to increasing noise at high magni- 
tudes. We can correct the measured ratio of BHBs to all sources 
for this effect. This correction mitigates the falloff of the ratio 
at the faint end. The corrected curve is shown as the dotted blue 
line. We use this as the basis of the magnitude dependent prior, 
but the correction causes a spike at the faint end that is probably 
due to small numbers of sources and is obviously not desirable 
in the prior. For this reason we have truncated the function be- 
fore it turns up and adopted a plateau for the high magnitude 
end. Similarly, we adopt a plateau at the bright end, where the 
fraction may be strongly affected by the SDSS spectrum selec- 
tion function (there is a cutoff at g < 14 for the Legacy spectra, 
dAdelman-McCarthv et alj|200 6)). The adopted prior as a func- 
tion of g is plotted as the solid black line in Fig. [4] 

Figure [5] shows the ratio of density functions for BHB and 
non-BHB sources as a function of absolute Galactic latitude. 
This ratio shows a relatively smooth trend, with quite a lot of 
structure superimposed. We model it with a straight line fit and 
adopt this fit as the relative prior in latitude. 
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Fig. 3. The effect of in creasing noise on the spectroscopic clas- 
sification of IXue et all as illustrated in Fig. [3] The line shows 
the ratio of the number of output BHB stars, that is, the sources 
classified as BHB stars, regardless of whether they really are or 
not, to total stars. 
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Fig. 4. The dashed green line shows the ratio of the density func- 
tion of BHB stars to the density function of all stars. The dot- 
dashed red line shows the ratio of BHB stars to non-BHB stars 
from the classification described in Sect. I3.3I This is the same 
curve plotted in Fig. [3] but it has been renormalized so that the 
peak is at the same level as the peak of the ratio of density func- 
tions - i.e. so that it doesn't include the simple prior. The dot- 
ted blue curve is the result of correcting the basic BHB fraction 
(dashed green line) for the expected change in the ratio due to 
noisy spectra. The regions at either end are replaced with a con- 
stant value, and the prior actually adopted is plotted as the thick 
solid black line. 
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Fig. 5. The solid green line shows the ratio of BHB star density 
to the sum of BHB density and other stars density as a function 
of galactic latitude b. The dashed black line is a linear fit used to 
build the 2-dimensional prior as described in the text. 
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If these priors are independent of one another, we can ap- 
ply them in sequence to the output posterior probability of the 
classifier using 



Table 1. Coefficients for decision boundary in u - g, 



P(C\D u D 2 ,....,D N ) = nl 



P(QD„) 



N-l ' 



(3) 



where P(C\D„) is the probability of class membership given 
some pie ce of information, D „ . This formula is discussed in 
depth in iBailer- Jones & Smith] (12010). The issue of class frac- 
ti on priors and its influenc e on classifier training is discussed 
in B ailer- Jones et al .1 (I2008h . The correlation of g and b is low, 
with a Pearson coefficient of -0.0017. The assumption of inde- 
pendence therefore holds well. 

The ratio of BHB stars to all stars for all data is 0.26, how- 
ever this includes regions where the selection function for SDSS 
spectra has a large effect (there is a cutoff at g — 14 for SDSS 
Legacy spectra and g,r or i—15 for SEGUE, and at g > 19 the 
reliability of the lXue et al.l classification method becomes diffi- 
cult to assess). The ratio of BHB stars to all stars in the interval 
14 < g < 19 is 0.32. We use this latter fraction as the class 
fraction. 

For the analysis of the different classifiers, we consider two 
different priors, a simple ratio (equal to 0.32) that represents the 
fraction of BHB stars to all stars over the sample in the interval 
14 < g < 19, and the combination of this with the priors as 
functions of g or b as discussed above. We refer to the first prior 
as a 'simple prior' and to the second as a '2d prior', because it is 
a function of the two variables g and b. 

3.5. Priors and performance measures 

The issue of the class fractions enters the analysis in two distinct 
ways, and it is worth discussing this issue explicitly because it 
can easily lead to confusion. 

Firstly, the class fractions are the single most important con- 
tribution to the prior probability used to obtain posterior proba- 
bilities for each object. This issue is reasonably clear. 

Secondly, as well as adjusting the classifier probabilities with 
the prior, when testing the classifiers, we also have to take ac- 
count of the uneven expected class fractions in the measured 
contamination (and in any other quantity where they would be 
important - the completeness is not affected, as can be seen from 
Equation[TJ. We can do this by either using a test set that reflects 
the true expected fractions (which would be possible in our case 
because the classes are not extremely unbalanced), or by cor- 
recting the contamination for the difference between the input 
test set fractions and the expected true fractions. Since the pop- 
ulation composition as a function of g and b is already present 
in the test set, no correction should be made to the test output 
to correct for the relative fractions of BHB stars as a function of 
these quantities. 

Note that this reweighting of the contaminants in the output 
sample has to be carried out anyway if the expected class frac- 
tions are different to the fractions in the test set, whether or not 
we also apply the prior to the classifier probabilities. All the con- 
taminations presented in this paper, except that in Table [2] are 
based on test sets with equal class fractions, and are corrected 
to the expected class fractions after the classification using the 
estimated fraction of 32% BHB stars. 

The issue of priors in the context of a classification prob- 
lem with a highly unbalanced dat a set was addressed in some 
depth by Baile r- Jones et al.l J2008). In particular, Sect. 2.5.1 of 
this paper discusses the issue of correction of the contamination 



Segment 
-0.3 <g-r<=- 
-0.25 < g - r <= 
-0.15 <g-r<-- 



0.25 
-0.15 
= 0. 



m 
2.4 
1.6 

-0.533 



c 
1.62 
1.42 
1.1 



Table 2. Results of classification with decision boundary. 



BHB 
OTHER 

Unclassified 

BHB 
OTHER 

BHB 



Absolute 

bhb 
1963 
2560 
436 

Percentage 
80.16 
43.53 

Completeness 
0.802 



other 

486 

3321 



19.84 
56.47 

Contamination 
0.566 



in more depth than is possible here. There the specific problem 
was identifying quasars amongst stellar samples, which is an ex- 
tremely unbalanced problem. The issue with BHB stars is less 
severe. 



4. Comparative performance of machine learning 
techniques 

4. 1 . Colour box and direct decision boundary 

lYannv et al.l J2000) derived a decision boundary in u - g g - r 
colour space to distinguish low gravity BHB giants from con- 
taminating MS and BS stars in the colour box. Their decision 
boundary consists of three straight line segments and is shown 
in their Fig. 10. Our estimates of the gradients and intercepts of 
the line segments are shown in Table [TJ Sources are BHB stars 
if they have u - g < m(g - r) + c, with values of m and c taken 
from the Table. 

We classified our test set with this boundary, and obtained 
the results summarized in Table [2] All sources with g < 19 were 
classified, since no training set is needed. The test set class frac- 
tions are not artificially balanced, so no prior has been applied. 
Sources lying outside the ranges of Table [TJremain unclassified. 
The results are presented first in the form of a confusion matrix. 
Each row of the matrix corresponds to a particular true class, ei- 
ther BHB or other. The rows are labeled in capitals to indicate 
that this is the true class of the object. The columns list the out- 
put classifications. The leading diagonal of the matrix therefore 
shows the true classifications. The off diagonal elements indicate 
misclassifications, and it is possible to see which classes are par- 
ticularly confused with one another. The confusion matrix is pre- 
sented twice, once with the absolute numbers of objects in each 
classification bin, and once with the classifications expressed as 
percentages of the total number of input objects of that class. 
The rows of this matrix therefore sum to one. We also present 
the completeness and contamination obtained with this method. 

4.2. k-Nearest Neighbours 

Nearest neighbour techniques are probably the simplest and 
most intuitively obvious method for supervised classification. 
For a given new object, we select the k nearest training points 
in the data space and assign a class based on the classes of the 
neighbours. For k > 1 we could choose to select a simple ma- 
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jority of objects, or we could impose a higher threshold in an at- 
tempt to improve the purity of one or both of the output classes 
(i.e. BHB stars or other). Introducing a threshold implies that we 
must be prepared to tolerate non-classifications. 



Table 3. Results for kNN classification. 
With equal priors 





Absolute 






bhb 


other 


BHB 


1008 


235 


OTHER 


347 


871 




Percentage 




BHB 


81.09 


18.91 


OTHER 


28.49 


71.51 



▲ ▲ A ▲ ▲ 



0.5 



0.6 0.7 0.8 0.9 

min(P(BHB)) 



Fig. 6. The effect on the completeness (green squares) and con- 
tamination (red triangles) of varying the minimum probability 
(including the effect of the 2d prior) required for a positive clas- 
sification in the kNN method. The completeness (necessarily) 
falls as the threshold is increased. 



Completeness Contamination 



BHB 


0.81 1 


0.422 


With simple prior 








Absolute 






bhb 


other 


BHB 


743 


500 


OTHER 


245 


973 




Percentage 




BHB 


59.77 


40.23 


OTHER 


20.11 


79.89 




Completeness 


Contamination 


BHB 


0.598 


0.412 


With 2d prior 








Absolute 






bhb 


other 


BHB 


920 


323 


OTHER 


264 


954 




Percentage 




BHB 


74.01 


25.99 


OTHER 


21.67 


78.33 



A probability can be estimated from the fraction of neigh- 
bours belonging to each class, so for example if nine out of 
ten of the nearest neighbours are BHB stars, we would estimate 
P(BHB) = 0.9. 

We ran the kNN technique for various choices of k and mea- 
sured the output sample completeness and contamination. The 
classifier was run with ten resamplings of the training and test 
sets for each k, from k — 1 up to k — 100, and the classification 
was performed with simple majority voting. The completeness 
was found to be approximately constant with increasing k, but 
the contamination showed a shallow minimum at around k = 15, 
which we selected as the optimum value. 

We also experimented by cutting the colours used from four 
down to two (u - g and g - r). The result was a slight degrad- 
ing of the results for all values of k. We therefore use the kNN 
technique with the four dereddened colours. 

We next investigated the effect of varying the confidence 
threshold for classification and measuring the completeness and 
contamination of the output BHB star sample. Increasing the 
threshold would be expected to lead to a loss of completeness, 
but also a lowering of the contamination. The results of this for 
kNN are shown in Fig. [6] The completeness does indeed fall, but 
the contamination remains constant at around 0.4. 

We performed ten classifications, with resampled training 
and testing sets, with the kNN method to get a final estimate 
of performance. The value of k and the probability threshold 
were left fixed (k - 15, threshold=0.5). The results are shown in 
Table [3] This table is divided into three sections. In the top sec- 
tion, we present the results of applying the classifier to the test 
data without applying any prior. This is equivalent to assuming 
equal true class fractions. The second section presents the results 
with the application of the so-called simple prior, with which we 



Completeness Contamination 
BHB 0.740 0.378 

Notes. Top section: Confusion matrix showing the results of kNN clas- 
sification (k=15) with threshold P(BHB) > 0.5. This is the combined 
result of ten independent classifications with resampling of the trainin g 
and testing sets. The rows show the true class (according to lXue et"aT1) . 
the columns show the classifier output class. When shown as percent- 
ages, the quantities in the rows should add to 100%, but the quantities in 
the columns in general do not. The completeness and contamination are 
also shown - and the contamination is corrected for the class imbalance 
using the simple prior. Middle section: The same confusion matrix, with 
output probabilities corrected for the simple prior. Bottom section: The 
results with output probabilities corrected for the prior probability as 
function of g magnitude and galactic latitude. 

correct for the effect of the class fractions only. The final sec- 
tion presents the results with the application of the 2d prior, a 
function of g and I. In each section we present the results as con- 
fusion matrices of absolute classifications and as percentages of 
the input true classes. Finally we present the completeness and 
contamination. 



Figure [7] shows the completeness and contamination for test 
samples classified with the kNN method, with the results binned 
by magnitude. The threshold for classification is always 0.5, and 
k-15. This is the average of 100 separate trials. The lower plot 
shows the standard deviation in each bin. 

As expected, the classifier performance falls off for fainter 
magnitudes. Part of this effect may be due to the natural con- 
fusion in the test set between BHB sta rs and non-BHB stars, 
introduced by the noise in the method o f lXue et al.l(l2008h . 
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Fig. 7. Top: Completeness (green squares) and contamination 
(red triangles) for sources of different magnitudes classified with 
the kNN technique. One hundred separate trials were averaged 
to produce this plot. The filled symbols and solid lines show the 
results using the 2d prior. The open symbols and dashed lines 
show the results using the simple prior only. Bottom: Standard 
deviations of completeness and contamination at each point. 

4.3. Kernel density estimation for classification 

We next consider a kernel density estimation (KDE) approach 
to the classification. The density estimate is a weighted mean 
of n eighbours, th e weighting function being a kernel of choice. 
See iHastie et aH (l200ll) for a gen eral discussion of the method, 
and see Richa rds et al.l (l2009allbl) for examples of KDE used to 
identify quasars in SDSS data. 

We use an Epanechnikov kernel, which is truncated and 
so is less influenced by distant points. In practice, the choice 
of kernel is usually less important than the bandwidth value. 
The bandwidth was set independently for each dimension. The 
package np in RE| was used to implement the KDE method 
dHavfield & Racing I2008I) . and also to determine the optimal 
value for the bandwidth, using the method of iLi & Racind 
d2003h . This is based on leave one-out-cross validation and in- 
volves minimizing the variance amongst trial density functions 
constructed with different bandwidth values. 

Trial and error experimentation with the available colours 
shows that reasonable results can be obtained with u-g and g— r, 
but the addition of further colours degrades the performance. We 



construct density functions for both the BHB stars and the non- 
BHB stars and compare the values at the locations of test data 
or new data points to classify the source. The individual density 
functions for BHB and non-BHB sources in the training set are 
shown as contours in u-g, g-r space in Fig. [8] The probability of 
an object being of class cl from a number n c of possible classes 
is taken to be 



P(j) = 



K j=cl (x) 



(4) 



where Kj are the density functions for each class and x are the 
data. 

Training and test sets were independently selected ten times 
and used to train and test a model. We applied the KDE classifier 
to the test set under the assumptions of equal class sizes (flat 
prior), the simple prior (P(BHB) = 0.32 for all sources), and the 
2d prior. The results for all these tests are shown in Table [4] The 
layout of this table is the same as Table [3] 



As with the kNN method, we experimented with thresh- 
olds at different levels of classification confidence. We adjust 
the threshold probability for BHB classification and record the 
resulting output sample completeness and the contamination. 
These are shown in Fig. [9] As expected, the effect of introducing 
a threshold higher than 0.5 for classification is to reduce both the 
completeness and contamination. 




* http://www.r-project.org 



0.7 

min[P(BHB)] 



Fig. 9. Effect of varying the probability threshold with the KDE 
method for classification as a BHB star on the completeness 
(green squares) and contamination (red triangles) of the output. 
The results shown in Table |4]correspond to a probability thresh- 
old of 0.5. Output probabilities have been modified by the 2d 
prior. 



Figure[lO]shows the completeness and contamination for the 
test sample classified with the KDE method, with the results 
binned by magnitude. The threshold for classification is always 
0.5. 



4.4. Support vector classification 

Support vector classification is a supervised method in which a 
high dimensional decision boundary is fit between two classes. 
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Fig. 8. The density of points for the non-BHB star training set (left) and the BHB star training set (right) in the u — g, g — r plane. 
These density functions are used for the KDE classification. Contours range from 2 to 22 stars per unit area in steps of 2. 
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Fig. 10. Top: Completeness (green squares) and contamination 
(red triangles) for sources of different magnitudes classified with 
the KDE technique with one hundred trials. The filled symbols 
and solid lines show the results using the 2d prior. The open 
symbols and dashed lines show the results using the simple prior 
only. Bottom: Standard deviation of completeness and contami- 
nation over one hundred trials. 



The boundary is chosen to maximize the margins with the 
nearest repres entative points of each class (the so-called support 
vectors). See IVapnikl d 19951) for a fuller description. A linear 
SVM defines a boundary that is linear in the original data space 
(in our case the four SDSS colours). By using a kernel function, 
a higher dimensional feature space can be defined, and the 
decision boundary instead defined in this. We use the second 
order radial basis function as a kernel here. This function has 
a single parameter, gamma, which must be set before training 
the model. To deal with the problem of regularization for 
noisy data, a cost parameter can be introduced, that acts to 
soften the margin. The cost parameter is so called because it 
controls the extent to which the algorithm will attempt to fit a 
more complex boundary in order to correctly classify all of the 
training points, i.e. it is the 'cost' to the algor ithm of misfitting 
traini ng points during the model training (Cortes & Vapnik 
11995b . We use the libSvm implementation, which is available 
onlin e (http : //www . csie . ntu . edu . tw/~c j lin/libsvm/, 
IChang & Linl (2001)) and is implemented in the R package 
el071. 



4.4.1. Probabilities from SVM 

The SVM method is not designed to provide probabilities, since 
it deliberately discards many of the training points, using only 
the support vectors to build the model of the decision boundary. 
However, a probability estimate can be made base d on the dis - 
tance of a test point from the decision boundary (lPlatlH l999). 
The actual probability returned is based on a model fitted to the 
training data. This probability estimate is essential if we want to 
trade off completeness versus contamination, or use priors. 

The training data were standardized colour by colour so that 
each of the colours had zero mean and unit standard deviation. 
The same offset and scaling, calculated from the training data, 
were applied to the testing data. The SVM was run over a grid 
of parameters; cost and gamma, with a fourfold cross validation 
using the training data to determine the best choice for these val- 
ues. The optimum values chosen were gamma=0.25, cost=64. 
The model was then trained on the training set and applied to the 
test set. The basic classification performance is shown in Table[5] 
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Table 4. Results for KDE classification. 
Equal priors 

Absolute 



bhb other 

BHB 1021 222 

OTHER 505 713 

Percent 

bhb other 

BHB 82.1% 17.9% 

OTHER 41.5 % 58.5 % 

Completeness Contamination 

BHB 0.821 0.512 



With simple prior 

bhb other 

BHB 765 478 

OTHER 281 937 

Percent 

BHB 61.5 % 38.5 % 

OTHER 23.1% 76.9% 

Completeness Contamination 

BHB 0.615 0.438 



With 2d prior 

bhb other 

BHB 912 331 

OTHER 293 925 

Percent 

BHB 73.4 % 26.6 % 

OTHER 24.1 % 75.9 % 

Completeness Contamination 

BHB 0.734 0.405 



Notes. Top section: Confusion matrix showing results of KDE clas- 
sification of 10 independently selected test sets based on 10 indepen- 
dently trained models. The classification threshold is P(BHB) > 0.5 in 
each case. The results are shown at the top as mean numbers of sources 
in each category and then as percentages. The completeness and con- 
tamination for the BHB output sample is also shown. These are cor- 
rected for the expected unbalanced class fractions. Middle section: The 
same quantities calculated with the simple prior applied to the classifier 
output probabilities. Bottom section: The confusion matrix, BHB com- 
pleteness and BHB contamination obtained when the prior as a function 
of g and latitude is applied to the classifier output probabilities. 



We consider the effect of a threshold on the measured com- 
pleteness and contamination. The results of introducing various 
thresholds greater than P(BHB) = 0.5 are shown in Fig. [TT] It 
can be seen from Fig. [TT]that the completeness and contamina- 
tion both fall as the threshold is increased, except for very high 
thresholds when the contamination in fact rises. This is possible 
if the set of sources with the highest values of P(BHB) contain a 
large number of contaminants. This is undesirable, but is partly 
caused by the low number of sources with high P(BHB) - in fact 
there are only thirteen sources with P(BHB) > 0.9. 

Figure[T2]shows the completeness and contamination for the 
test sample classified with S VM, with the results binned by mag- 
nitude. This plot shows the results both with the simple prior 
and the 2d prior. As with the KDE method, the SVM performs 
well for 14 < g < 18 but progressively more poorly for fainter 
sources. 



CO 




0.5 0.6 0.7 0.8 0.9 

min[P(BHB)] 



Fig. 11. Plot of completeness (green squares) and contamination 
(red triangles) as a function of a threshold probability for BHB 
classification in the case of the SVM classifier. The results shown 
in Table [5] correspond to a probability threshold of 0.5. Output 
probabilities have been modified by the 2d prior. 
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Fig. 12. Top: Completeness (green squares) and contamination 
(red triangles) for sources of different magnitudes classified us- 
ing SVM with one hundred trials. The filled symbols and solid 
lines show the results using the 2d prior. The open symbols 
and dashed lines show the results using the simple prior only. 
Bottom: Standard deviations of one hundred trials. 
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Table 5. Results for SVM classification. 
Flat prior 





Absolute 






bhb 


other 


BHB 


988 


255 


/ y 1 ■ 1 Tiii 

UlHbK 


305 


913 




Percent 






bhb 


other 


BHB 


79.5% 


20.5% 


OTHER 


25.0% 


75.0% 




Completeness 


Contamination 


BHB 


0.795 


0.396 


Simple prior 








Absolute 






bhb 


other 


BHB 


836 


407 


/ V | ' 1 1 1 I ) 

OlHbK 


188 


1030 




Percent 




BHB 


67.3% 


32.7% 


OTHER 


15.4% 


84.6% 




Completeness 


Contamination 


BHB 


0.673 


0.323 


With 2d prior 








Absolute 






bhb 


other 


BHB 


912 


331 


OTHER 


187 


1031 




Percent 




BHB 


73.4% 


26.6% 


OTHER 


15.4% 


84.6% 




Completeness 


Contamination 


BHB 


0.73 


0.303 



Notes. Top section: Confusion matrix showing results of an SVM clas- 
sification without priors and with a threshold of P(BHE) > 0.5, also the 
completeness and contamination in the output BHB sample, corrected 
for the expected class imbalance. Middle section: the same quantities 
obtained applying the simple prior probability for all sources. Bottom 
section: The confusion matrix and completeness and contamination 
found when applying the prior as a function of g and latitude. 

4.5. Optimal choice of classifier 

From the bare results in Tables [3] [4] and [5] using the 2d prior, all 
the techniques have very similar completeness. The contamina- 
tion is best in the case of SVM with about 0.3, and worst for the 
KDE with 0.4. The decision boundary method of lYannv et al.l 
(2000) should properly be compared with the simple prior case 
for the three machine learning methods - the test set naturally 
has the right class fractions, but because the method is not prob- 
abilistic, the correction for the 2d prior cannot be made. The 
completeness of the decision boundary method is clearly better 
than the three machine learning methods. The contamination of 
over 50% is however worse than any of them. 

From the magnitude performance plots in Figs. 171 [TOl and [T2l 
it can be seen that all the methods achieve a high completeness 
and low contamination for the approximate range 15 < g < 17. 
The contamination achieved by the SVM technique for the re- 
gion of best performance between 15<g<17is slightly better 
than for the kNN. 

To make a more direct comparison, we plot in Fig. [13] the 
completeness and contamination, averaged over ten indepen- 
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Fig. 13. Top: Completeness (green) and contamination (red) for 
test samples with 15 < g < 17 classified with the SVM, KDE 
or kNN. Different shades and symbols are used to distinguish 
the methods. All results are modified with the 2d prior and the 
contamination is corrected for class fractions. Results are the av- 
erage of ten independent runs. Bottom: Same data as in the top 
plot, with completeness plotted directly against contamination 
for direct comparison. The SVM results are always below and to 
the right of the other methods, demonstrating lower contamina- 
tion for a given completeness. 



dent trials, for all the methods as a function of the classification 
threshold. For this plot, we restrict the test sample to sources in 
the range 15 < g < 17, where all the methods perform reason- 
ably well. 

From this comparison, we can note the following; The SVM 
and kNN methods deliver similar completeness over most of the 
range of thresholds. The kNN technique maintains completeness 
better than SVM for very high thresholds. However, the kNN 
method does not show any significant improvement in contami- 
nation, and it never delivers a better contamination than the other 
methods for similar completeness. The other techniques do show 
a falling contamination with increasing threshold. In the lower 
plot, it is clear that the SVM delivers on average a lower con- 
tamination for a given completeness. 

In summary, all the methods perform reasonably well, but 
the SVM seems to have the edge across the largest range of con- 
ditions, and we choose to use this technique on the new data. 



10 



K.W. Smith et al.: Photometric identification of blue horizontal branch stars ** 



5. Classification of new data 




0.00 0.02 0.04 0.06 0.08 0.10 



Fig. 14. The fraction of new data points passing through the one 
class filter as a function of the parameter v. The higher the value 
of the parameter, the more objects are rejected, and fewer then 
remain for the main classification. The fraction reaches a plateau 
at around v = 0.01. 

We obtained new DR7 photometry from SDSS and use the 
various models to predict the classes (BHB versu s non-BHB). 
The DR7 data were obtained using the colour cut of lYannv et al.l 
(see Sect.[2]i. The search yielded 859,341 objects at g < 23. This 
magnitude cutoff is very deep, being 0.8 magnitudes deeper tha n 
the 95% completeness limit for DR7 dAbazaiain et all l2009h . 
However, the selection requires good photometry in all five 
SDSS bands, and the classification method will enforce the con- 
dition that classified objects occupy the data space defined by 
the training set (see next section), so that the number of spurious 
objects in the sample will eventually be very low. 
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Fig. 15. (Top) g magnitude distribution of sources from DR7 se- 
lected to lie in the same data space as the SVM training data. 
(Bottom) g magnitude distribution of sources classified as BHB 
stars. 



5. 1 . One class filter 

Before attempting to classify the new data, it is necessary to ex- 
clude points which lie outside the locus of the available training 
points. This issue did not arise when using the testing data as de- 
scribed previously, since all input sources are by definition part 
of the defined data set and could potentially be used to train a 
classifier. New points lying in a 'hinterland' outside the train- 
ing data locus and well away from the decision boundary may 
be misclassified with high confidence levels, since the probabil- 
ity model is based on distance from the decision boundary. It is 
necessary to exclude such points prior to attempting the classifi- 
cation. 

To do this, we used an SVM in one-class mode. The one- 
class SVM defines a decision boundary which separates the 
training data from the origin with a maximized margin. A pa- 
rameter, v, controls the rigidity of the boundary, and hence what 
fraction of the training set would typically be excluded. We col- 
lected all the available training objects (2,536 BHBs and 7,511 
non-BHB s) together into one set and standardized according to 
this data. We conducted an experiment with different values of 
the parameter v to see how many of the new data points would 
pass through. The results are shown in Fig. [14] From this fig- 
ure, we see that the fraction of sources passing through the filter 
reaches a plateau for values of v a little less than 0.01. We chose 



v = 0.01 based on this fact, and on a visual inspection of the 
region of the data space in u - g, g - r space occupied by the 
surviving points. 

Filtering the photometry according to consistency with the 
training set, we were left with 294,652 objects. The distribution 
of these in g magnitude is shown in Fig.[l5](left hand side). 

5.2. Classification 

For the classification, we trained a new SVM model using all 
the available BHB stars, plus an equal number of randomly se- 
lected non-BHB stars. The 2d prior probability was used to ob- 
tain posterior probabilities, and a threshold of 0.5 was applied 
to these. With this threshold, 27,074 of the new sample objects 
were classified as BHB stars. Figure [TBI shows the probabilities 
output from the SVM classifier plotted against the probabilities 
modified by the 2d prior. The threshold for BHB classification 
is shown by the horizontal line. The threshold obtained by as- 
suming a prior probability P(BHB) = 0.32 for all sources is 
shown with the vertical line. It can be seen that there are peaks 
in source density at low and high probability, so that the choice 
of prior does not dominate the classification. The distribution of 
classified BHB stars in g magnitude is shown in the right hand 
side of Fig. [TBI 
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5.2.1 . Stability of the probabilities 

The training set for the classifier is composed of all the available 
BHB stars, together with an equal number of non-BHB stars se- 
lected at random. This means in practice about one third of the 
non-BHB stars are included in the training set. The output prob- 
abilities, and eventual classifications in many cases, will even- 
tually depend on the exact choice of training data. To quantify 
the stability of the output probabilities, we performed ten resam- 
plings of the training data and subsequent classifications, and 
found the standard deviations of the output probabilities. 

In Fig. [17] we show a histogram of the standard deviations 
of the probabilities, and a plot of the standard deviation of the 
probability for each output sources versus the mean value of the 
probability obtained. There is a broad peak at about 0.015. The 
histogram has been truncated at P = 0. 1 . 





AM/ 







P(BHB): SVM output 



o 
o 




0.00 0.02 0.04 0.06 0.06 0.10 

Standard Deviation of P(BHB) 



Fig. 17. Histogram of the standard deviations of the probabilities 
output by the SVM classifier over ten classifications trained on 
ten separate resamplings of the available training data. 

Table 6. Probability standard deviations for BHB and non-BHB 
classes. 



SD> 


BHB 


non-BHB 


0.01 


72% 


77% 


0.02 


34% 


55% 


0.05 


16% 


29% 


0.10 


10% 


14% 



one could impose some condition based on the standard error 
for each object. 



Fig. 16. Classification probabilities P(BHB) for the new sources 
in the DR7 data set. The abscissa shows the probability returned 
by the SVM classifier, the ordinate shows the probability after 
modification with the two dimensional prior (a function of lat- 
itude and g magnitude). The horizontal line marks the P = 0.5 
threshold above which sources were classified as BHB stars for 
the purposes of this work. The vertical line shows the equiva- 
lent threshold for the SVM raw probabilities assuming a simple 
prior of P(BHB) - 0.32 for all sources. Contours are at 0.004, 
0.006, 0.01, 0.03, 0.09 and 0.12 times the maximum density. The 
highest density regions lie in the bottom left, top right and in the 
clump at approximately (0.8,0.8). 



In Table [6] we show the percentages of objects classified as 
BHBs or non-BHBs with standard deviation in the probability 
exceeding 0.01, 0.05 and 0.1. The peak occurs between 0.01 and 
0.02 (see also Fig. [PTT i. In each bin, fewer BHB sources than 
non-BHB sources have higher standard deviations than the given 
threshold. 

It is difficult to combine repeated probabilities into a single 
value with an uncertainty, and also to include the effect of the 
magnitude and latitude dependent prior, and we do not attempt 
to do this. Instead, we give in the output table (Table [7]i the 
raw SVM output probability from one classification, the stan- 
dard deviation on this from ten resamplings, the prior, and the 
posterior probability obtained by applying the prior to the raw 
SVM probability. We use the condition that the posterior prob- 
ability P(BHB) > 0.5 for BHB classification, but alternatively 



5.3. Photometric distances 



ISirko et al.l (120 04) gi ve abs olute g magnitudes based on mod- 
els by iDorman et aL d 19931) . for a range of BHB star properties 
(Teff, logg and metallicity - their Table 2), together with u - g, 
g - r and g - i colours . To determine photometric distances for 
our BHB stars, we perform a regression based on this data. We 
do this with a support vector machine in regression mod^B- 

We estimate the distance errors due to the uncertainty in 
the photometry by recomputing the distances with +lcr in the 
colours and in g. the distributions are shown in Fig. [18] 

Figure [19] shows the distribution of BHB stars on the sky in 
the region of the north Galactic cap. The loca tions of a s elec- 
tion of globular clusters taken from the list o f lHarrisl(ll996h are 
shown as black circles. These were used to make a test of the 
BHB distances, as described below. Also indicated, with a box, 
is the location of the Ursa Minor dwarf galaxy. The BHB popu- 
lation of this galaxy is clearly visible as a clump of distant stars. 

5.4. A distance test 

To help assess the accuracy of our BHB identifications and dis- 
tance determination, we compare our data t o a se l ection of glob- 
ular clusters taken from the catalogue of lHarrisI (1 1996b and se- 
lected to lie in the north Galactic cap region. Their positions are 



* The SVM for regression fits a regression line in a high dimen- 
sional feature space, rather than a classification boundary. This involves 
an extra parameter which must be tuned. For a full description see 
iDrucker et al. ( 1996) or the libSVM documentation. 
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Fig. 18. The fractional change in the derived distance caused by applying a random offset to either the g magnitude or the colours. 
The offsets are drawn from a Gaussian distribution with the appropriate cr for each object. On the left, the effect of the colour 
uncertainty on the fitting result, on the right, the effect of the photometric error in g. Both distributions have a width of the same 
order of magnitude. 




RA [deg] 

Fig. 19. BHB stars in the north Galactic cap region, shown in Aitoff projection. The stars have been colour coded for distance as 
follows; blue=cl oser than 15k pc, green=15-40kpc, red=further than 40kpc. The positions of a few globular clusters selected from 
the catalogue of lHarrisI (1 19961) and used for a distance test are shown as black circles. The position of the Ursae Minoris dwarf 
galaxy is marked with a box. 
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Cluster distance [kpc] 

Fig. 20. Distances to globular clusters taken from the catal ogue of lHarrisI compared to the mean distance to BHB stars within one 
tidal radius of the cluster centre (both quantities taken from H arrisl The straight line shows exact agreement (x = y). Error bars are 
plotted where possible (> 1 source identified). The main plot shows the full sample, the inset shows the portion closer than 30kpc 
at a larger scale. Globular clusters with no detected BHB population are not shown. 



shown in Fig. [19] We considered all BHB stars within half a 
tidal radius of each cluster centre to be probable cluster mem- 
bers, and determined the mean distance to those stars, and the 
error on the mean. We t hen com pare those distances with the 
ones given in the table of lHarrisi Out of fifty-two globular clus- 
ters within the north Galactic cap region, sixteen had at least one 
BHB star from our sample within half a tidal radius of the centre. 
In Fig. [20] is shown the mean distance for each cluster derived 
from the BHB population compared to the accepted distance 
give n in the catalogue. The agreements with the distances taken 
from lHarrisI are generally reasonably good for clusters around 
20 kpc distant. Amongst the nearby clusters are several with 
overestimated distances, two of which contain only one source 
each. Overestimated distances would be expected if the cluster 
membership is contaminated with non-BHB stars, since the con- 
taminants are generally fainter than the BHB stars and so the 
distances will be overestimated in those cases. A more distant 
cluster at just over 80 kpc also has an overestimated distance. 

In Fig. [21] we use the information from the globular clusters 
distance test to further investigate the performance of the clas- 
sification. We select all sources within half a tidal radius of a 
cluster centre and calculate the distance to these assuming they 
are BHB stars (which they will not all be). We then find the frac- 
tional absolute residual, R, between this distance, which we call 
d* and the accepted cluster distance do, 



_ \d„ - dp\ 
d Q 



We use the absolute residual because the vast majority of sources 
that have distances inconsistent with the cluster distance are 
placed on the too distant side (most contaminants will be intrinsi- 
cally fainter than BHB stars). We plot this against the (prior cor- 
rected) SVM probability P(BHB). Sources with P(BHB) > 0.5 
are classified as BHB stars for the purposes of this plot. 

In Fig. [2T] sources which are true BHB stars within the 
cluster should appear close to the cluster distance. We would 
like to see as many as possible appearing with high probabil- 
ity P(BHB). Sources which are in the cluster but are not BHB 
stars should be assigned distances greater than the true cluster 
distance, as they are intrinsically fainter. Ideally, these sources 
should of course have P(BHB) < 0.5. BHB stars that are not 
genuine cluster members could be in the foreground or the back- 
ground, so could appear more or less distant than the true cluster 
distance. Non-BHB stars in the foreground or background could 
appear at greater or lesser distance than the cluster, but their ap- 
parent distance would be greater than their true distance due to 
their lower luminosity. 

We can see in the figure that there is a clump of appar- 
ent BHB stars at the cluster distance and with high probabil- 
ity P(BHB). The majority of the sources with incompatible dis- 
tances for cluster membership also have P(BHB < 0.5). There 
are a few sources with P(BHB) > 0.5 and incompatible dis- 
tances. These are probably false positive misclassifications, al- 
though we cannot rule out the possibility that they are simply 
foreground or background BHB stars. 
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4 6 
Fractional distance residual 



Fig. 21. An analysis of classifier performance based on likely 
membership of known globular clusters. Sources are selected 
based on proxim i ty to a known globular cluster from the cat- 
alogue of iHarrisI {1996). Distances are computed for all these 
sources, assuming they are all BHB stars, and compared to the 
catalogue distance for the cluster. True BHB stars that are re- 
ally cluster members should return a distance consistent with 
the cluster distance. Most contaminants from within the cluster 
will appear too distant compared to the BHB population, because 
they are fainter. This plot shows the fractional absolute resid- 
ual in the distance estimate in kpc versus the (prior-corrected) 
SVM probability P(BHB). The dashed line shows the threshold 
P(BHB) = 0.5. 



6. A catalogue of BHB stars from DR7 photometry 

In Table|7]we give the basic data for the sample of DR7 sources 
classified by us. The first four columns of this table show vari- 
ous IDs from SDSS. Column one is the PhotObjld (long) from 
the SDSS PhotObj table. Columns two to 4 are the plate ID, 
MJD, and fiber ID for spectroscopic observations (where avail- 
able) from the SDSS SpecObj table. Columns five to eight are 
the RA, Dec, 1, and b in degrees. Columns nine through thir- 
teen show the SDSS photometry (u, g, r, i, z psfMags). Columns 
fourteen through eighteen show the errors in the photometry 
in magnitudes. Columns nineteen through twenty-three show 
the extincti on in eac h band. Column twenty-four lists the cat- 
egory from lXue et all This can be either 'BHB', meaning BHB 
star from the DO. 2 fm method, confirmed by c y , b y , 'Other', 
meaning BHB star from DO. 2, fm method, rejected by c y , b y 
method, 'BS', meaning Blue straggler from D0.2,fm method, 
'MS' meaning main sequence st ar from D 0.2,fm method, or 
'None' meaning not present in the lXue et alj catalogue. Column 
twenty-five shows the raw output probability from the SVM. 
Column twenty-six shows the standard deviation of this proba- 
bility over ten trials with resampled training set. Column twenty- 
seven shows the (2d) prior used. Column twenty-eight shows 
the posterior probability calculated from the SVM probability 
by applying the prior. Columns twenty-nine and thirty show the 
assigned distance in kpc and the fractional error. 

In the Appendix, we give all the information needed to di- 
rectly apply the SVM one-class filter and the two-class classifier 
to new data for which the SDSS colours, u — g, g - r, r — i and 
i - z are available. 



6.1. A warning on extinction 

The classification is based on dereddened magnitudes, and the 
deredd ening is performed by the SDSS pipeline based on the 
map of Schleg el et alJ ([1998). This is expected to work well at 
high Galactic latitudes, but for sources in the disk the extinctions 
may not be reliable. Furthermore, these maps give the line-of- 
sight extinction to the edge of the Galaxy, so they will underesti- 
mate extinction in all cases, possibly by a non-negligible amount 
even at high latitudes for nearby BHBs. 

The catalogue we present contains 181,022 sources with 
\b\ < 10 out of the total of 294,652 sources. Out of the proba- 
ble BHB stars, 7,231 sources have P(BHB) > 0.5 and \b\ < 10, 
and 19,843 sources have P(BHB) > 0.5 and \b\ > 10. Users 
should be aware of this issue when considering sources at low 
Galactic latitude. 

To quantify this effect, we calculated the completeness and 
contamination in the test sample as a function of absolute 
Galactic latitude. The results are shown in Fig. [22] From this, 




40 60 

|b| [deg] 



Fig. 22. An analysis of classifier performance on the [Xue et all 
testing set as a function of absolute Galactic latitude, \b\. The 
completeness (green squares) and contamination (red triangles) 
are plotted. The sample was restricted to sources with g < 17.5. 
The 2d prior was used, and the contamination is corrected for 
the likely class fractions. 



we can see that the performance of the classifier holds up well 
for \b\ > 30°. Below that, there is some degradation, and the per- 
formance becomes quite bad in the lowest bin, with \b\ < 18°. It 
is difficult to assess the detailed behaviour of the classifier here 
because of the small number of test sources available (there are 
6 3 sources in the first bin, of which 18 were BHB stars according 
to lXue et all) . 



7. Conclusions 

Starting with a sample o f spec troscopically identified BHB stars 
published by IXue et all d2008h . we have trained a number of 
standard machine learning algorithms to distinguish BHB stars 
from other contaminating main sequence stars or other interlop- 
ers, using SDSS colours alone. We have investigated three meth- 
ods, with and without the use of probabilistic classification and 
prior probabilities, and we find that the support vector machine 
offers the best completeness while simultaneously minimizing 



15 



K.W. Smith et al.: Photometric identification of blue horizontal branch stars ** 



the contamination in the output sample. The kernel density esti- 
mator was able to provide comparable contamination, but with 
a lower completeness. The kNN method was able to match the 
completeness of the SVM, but not the contamination. Adjusting 
the classification thresholds altered this picture in various ways, 
but the SVM generally outperformed the other techniques. 

Using the most promising technique (SVM), we have classi- 
fied a la rge sample o f DR7 d ata selected to lie within the colour 
box of Yannv etail d2000t) . This sample comprises 859,341 
sources. We used a one-class filter (also based on an SVM), 
to select 294,652 of these as lying in the same colour space 
as the available training set. We have identified 27,074 of these 
as probab le BHB stars. This includes any already identified by 
IXue et ail Because our classifier relies on a randomly selected 
subsample of the available training objects, we ran multiple clas- 
sifications to quantify the stability of the output probabilities. 
The standard deviations of the output probabilities are also pro- 
vided in the table. 
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Table 7. Results of classification of new data. 



ObjID 

587747119446491418 
587740586797105255 
758874300138651868 
587740589481525502 
758874298530595227 
587747072734134392 
758874297994576583 
587727178449485952 
587727225690128568 
587747117302546716 
587740587334041618 
587747119446425658 
588015508195639503 
758874298530726084 
587740525078905365 
587731186740822469 
587747073270808594 
587727179523227719 
758874299066679584 
587727179523293190 
587747119446622618 
758874370997223595 
588015507658833969 
587727225690259621 
588015508195704946 
758874297994641467 
587747119446753643 
758874297994838272 
588015509806383417 
587747121592730035 
587743959426203774 
587727225153388676 
588015509806383122 
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482 



RA 

0.010639 
0.010989 
0.031694 
0.032272 
0.040556 
0.044479 
0.046331 
0.047126 
0.049595 
0.052484 
0.066353 
0.067879 
0.070922 
0.087467 
0.088007 
0.091604 
0.107236 
0.110229 
0.122841 
0.123358 
0.220188 
0.222175 
0.226867 
0.233561 
0.234491 
0.240442 
0.244419 
0.256721 
0.287599 
0.291503 
0.297441 
0.307298 
0.308759 



Dec 

-4.960869 

23.874033 

34.673966 

26.043768 

28.786874 

-4.343054 

26.788068 

-10.668292 

-10.536831 

-1.704633 

24.353251 

-4.793711 

-0.669070 

28.430939 

25.819898 

0.301938 

-3.878229 

-9.809412 

30.574360 

-9.954755 

-5.260937 

27.058341 

-1.232878 

-10.446949 

-0.699505 

26.679747 

-5.568110 

26.225998 

0.598832 

-1.744949 

7.314223 

-10.991496 

0.537564 



1 

91.740344 

108.078305 

111.104952 

108.765105 

109.563938 

92.464250 

108.999375 

84.249601 

84.467652 

95.003385 

108.289338 

92.038001 

95.923795 

109.512049 

108.758217 

96.745254 

93.066164 

85.743476 

110.133990 

85.551926 

91.849680 

109.262490 

95.745605 

85.039354 

96.206318 

109.172876 

91.561486 

109.058576 

97.334501 

95.428721 

101.682300 

84.330154 

97.326490 



-64.673367 
-37.510853 
-27.016593 
-35.412494 
-32.750327 
-64.139463 
-34.693214 
-69.593383 
-69.486246 
-61.766531 
-37.059177 
-64.551268 
-60.829836 
-33.105980 
-35.641838 
-59.947468 
-63.752046 
-68.913977 
-31.028123 
-69.042703 
-65.035920 
-34.468046 
-61.408443 
-69.512947 
-60.923488 
-34.839752 
-65.319293 
-35.283989 
-59.750082 
-61.902162 
-53.471631 
-70.004463 
-59.814775 



23.62 
19.70 
16.21 
21.14 
20.98 
21.72 
23.38 
20.00 
22.68 
21.88 
19.45 
18.36 
21.76 
19.83 
24.00 
23.85 
19.59 
18.38 
19.39 
16.21 
23.44 
16.72 
19.29 
20.47 
20.82 
18.84 
22.96 
20.71 
23.28 
22.64 
17.74 
20.13 
17.38 



22.65 22.85 

18.43 18.39 

15.09 15.14 

19.87 20.01 

19.99 20.03 

20.38 20.59 

22.15 22.12 

18.82 18.82 

21.54 21.54 

20.96 20.95 

18.07 17.97 

17.19 17.41 

20.65 20.70 

18.70 18.68 

22.85 22.83 

22.90 23.06 

18.48 18.60 



17.29 
18.31 



17.31 
18.27 



15.12 15.27 
22.45 22.41 
15.65 15.61 
18.14 18.15 
19.35 19.40 
19.77 19.72 
17.68 17.82 
21.74 21.81 

19.61 19.61 
22.40 22.44 

21.62 21.68 
16.54 16.51 
18.86 18.94 

16.13 16.20 



22.20 
18.40 
15.27 
20.14 
20.07 
20.78 
21.68 
18.92 
21.16 
20.95 
17.99 
17.56 
20.70 
18.56 
22.78 
23.27 
18.77 
17.37 
18.25 
15.45 
22.41 
15.61 
18.23 
19.49 
19.81 
17.96 
21.58 
19.65 
22.46 
21.69 
16.55 
19.06 
16.33 



22.19 
18.39 
15.29 
20.14 
20.14 
20.69 
21.66 
18.88 
21.47 
20.80 
17.99 
17.70 
20.26 
18.42 
23.00 
23.61 
18.88 
17.46 
18.32 
15.58 
22.25 
15.69 
18.33 
19.39 
19.86 
18.08 
21.59 
19.67 
22.64 
21.25 
16.59 
19.03 
16.39 



Au Ag 

0.92 0.20 

0.03 0.01 

0.01 0.01 

0.09 0.01 

0.06 0.02 

0.17 0.02 

0.55 0.06 

0.05 0.02 

0.57 0.09 

0.23 0.04 

0.03 0.01 

0.02 0.01 

0.23 0.03 

0.03 0.01 

1.26 0.16 

0.94 0.15 

0.04 0.01 

0.02 0.01 

0.02 0.01 

0.01 0.01 

0.76 0.15 

0.01 0.01 

0.03 0.02 

0.08 0.02 

0.10 0.03 

0.02 0.01 

0.58 0.08 

0.06 0.01 

0.58 0.12 

0.46 0.06 

0.01 0.01 

0.05 0.02 

0.01 0.02 



Ar 

0.26 

0.01 

0.01 

0.02 

0.02 

0.04 

0.08 

0.02 

0.14 

0.05 

0.01 

0.02 

0.04 

0.02 

0.24 

0.25 

0.01 

0.01 

0.01 

0.01 

0.17 

0.01 

0.01 

0.01 

0.02 

0.01 

0.10 

0.01 

0.17 

0.11 

0.01 

0.01 

0.01 



A i A z 

0.24 0.71 

0.02 0.03 

0.01 0.01 

0.03 0.11 

0.02 0.08 

0.05 0.16 

0.08 0.28 

0.01 0.05 

0.16 0.60 

0.08 0.25 

0.02 0.02 

0.01 0.02 

0.05 0.15 

0.02 0.03 

0.32 0.62 

0.51 0.70 

0.02 0.03 

0.01 0.02 

0.01 0.03 

0.01 0.01 

0.30 0.66 

0.01 0.02 

0.02 0.03 

0.02 0.08 

0.02 0.10 

0.01 0.02 

0.14 0.45 

0.02 0.06 

0.29 0.75 

0.17 0.38 

0.02 0.02 

0.02 0.06 

0.01 0.02 



0.15 
0.57 
0.41 
0.21 
0.30 
0.19 
0.25 
0.17 
0.17 
0.17 
0.49 
0.16 
0.21 
0.31 
0.21 
0.14 
0.19 
0.18 
0.24 
0.19 
0.16 
0.21 
0.21 
0.21 
0.21 
0.19 
0.18 
0.21 
0.13 
0.20 
0.32 
0.19 
0.13 



gext r„, 

0.11 0.08 

0.42 0.30 

0.30 0.22 

0.15 0.11 

0.22 0.16 

0.14 0.10 

0.18 0.13 

0.12 0.09 

0.13 0.09 

0.12 0.09 

0.36 0.26 

0.11 0.08 

0.15 0.11 

0.23 0.16 

0.15 0.11 

0.10 0.07 

0.14 0.10 

0.13 0.09 

0.17 0.12 

0.14 0.10 

0.12 0.08 

0.16 0.11 

0.15 0.11 

0.15 0.11 

0.16 0.11 

0.14 0.10 

0.13 0.09 

0.15 0.11 

0.09 0.06 

0.15 0.10 

0.23 0.17 

0.13 0.10 

0.09 0.07 



0.06 
0.23 
0.16 
0.08 
0.12 
0.07 
0.10 
0.07 
0.07 
0.07 
0.20 
0.06 
0.08 
0.12 
0.08 
0.06 
0.08 
0.07 
0.09 
0.07 
0.06 
0.08 
0.08 
0.08 
0.08 
0.07 
0.07 
0.08 
0.05 
0.08 
0.12 
0.07 
0.05 



z„, Type 

0.04 None 

0.16 None 

0.11 None 

0.06 None 

0.08 None 

0.05 None 

0.07 None 

0.05 None 

0.05 None 

0.05 None 

0.14 None 

0.04 None 

0.06 None 

0.08 None 

0.06 None 

0.04 None 

0.05 None 

0.05 None 

0.06 None 

0.05 None 

0.04 None 

0.06 None 

0.06 MS 

0.06 None 

0.06 MS 

0.05 None 

0.05 None 

0.06 None 

0.03 None 

0.05 None 

0.09 None 

0.05 None 

0.03 BHB 



PsVM 

0.01 

0.34 

0.19 

0.79 

0.27 

0.99 

3.00 

0.25 

0.29 

0.98 

0.66 

0.74 

0.24 

0.87 

0.46 

0.06 

0.24 

0.13 

0.26 

0.33 

0.99 

0.27 

0.11 

0.08 

0.25 

0.39 

0.01 

0.20 

0.02 

0.5 

0.24 

0.71 

0.74 



AP 

0.01 

0.01 

0.02 

0.01 

0.04 

0.38 

3.74 

0.04 

0.30 

0.37 

0.06 

0.01 

0.38 

0.31 

0.11 

0.12 

0.01 

0.00 

0.00 

0.01 

0.33 

0.01 

0.00 

0.05 

0.01 

0.01 

0.01 

0.00 

0.04 

0.40 

0.00 

0.03 

0.01 



QgJ ^SVM 

0.23 0.004 

0.25 0.15 

0.43 0.15 

0.14 0.41 

0.14 0.05 

0.23 0.96 

0.14 5.18 

0.25 0.10 

0.25 0.12 

0.22 0.95 

0.29 0.45 

0.50 0.74 

0.22 0.08 

0.16 0.57 

0.14 0.13 

0.21 0.01 

0.30 0.12 

0.51 0.14 

0.22 0.09 

0.64 0.48 

0.23 0.99 

0.52 0.29 

0.35 0.06 

0.25 0.02 

0.22 0.08 

0.30 0.22 

0.23 0.00 

0.14 0.04 

0.21 0.00 

0.22 0.22 

0.56 0.29 

0.25 0.45 

0.60 0.82 



d (kpc) 

255.42 

31.01 

6.99 

68.92 

69.91 

87.27 

188.11 

42.72 

150.65 

113.95 

27.14 

20.34 

97.66 

38.58 

274.36 

269.88 

36.27 

21.04 

32.54 

7.67 

223.05 

9.68 

30.73 

53.28 

65.31 

24.98 

158.50 

60.71 

225.32 

145.68 

14.21 

43.32 

12.52 



Ad/d 
0.26 
0.01 
0.01 
0.02 
0.01 
0.04 
0.13 
0.01 
0.14 
0.06 
0.01 
0.01 
0.06 
0.01 
0.34 
0.27 
0.01 
0.01 
0.01 
0.01 
0.20 
0.01 
0.01 
0.02 
0.03 
0.01 
0.14 
0.01 
0.15 
0.11 
0.01 
0.01 
0.01 



Notes. A selection of the first few rows are shown. Some rows were omitted so that some objects with corresponding spectra are shown. 
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We use d photometr i c par allaxes derived from colour data 
presented in lSirko e t al. (2004) to derive distances for these ob- 
jects, using another variant of the support vector machine to 
make the fit to the colours. We performed a few simple checks 
on these distances, and on the spatial distribution of the classified 
BHB stars, to demonstrate that our method is reasonable. 

We include along with this work a catalogue of the 294,652 
DR7 sources together with probabilistic identifications as BHB 
stars, in the hope that these can be useful for other workers ei- 
ther directly as a ready made BHB sample, or as prior proba- 
bilities for spectroscopic BHB identification methods. We also 
provide, in the appendix, the data and parameters necessary to 
apply our classification to new colour data. The accuracy of the 
catalogue, or the classifier, can be estimated by reference to the 
various test results presented in the main body of the paper. In 
particular, Fig. Q~2] gives the estimated performance as a func- 
tion ofjnagnitude, although the reference classifications from 
the IXue et alj catalogue are unreliable for g > 19 as seen from 
Figs. |2] and [3] Figure QT| gives the expected effect of changing 
the required threshold probability for BHB classification, whilst 
Fig.|22]can be used to estimate the performance as a function of 
Galactic latitude. 

A general conclusion of this work is that, where reliable 
training sets can be identified, machine learning approaches such 
as those discussed here can probably extract more information 
than is available with simple colour cuts or ad hoc models. This 
type of approach is likely to be very fruitful in the future for 
surveys yielding large photometric datasets. 
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Appendix A: Applying the SVM model directly 

The application of the SVM model is mathematically straight- 
forward and not excessively laborious. We therefore give here 
the full specification of the SVM classification so that it can be 
directly applied to new data. 

The input data are the four dereddened SDSS colours u - g, 
g - r, r - i, i - z. The recipe for applying the model consists 
of five main steps, which are listed below. Below, we give de- 
tailed instructions for each step, together with tables containing 
the necessary model data. The steps are: 

1 . Apply the one-class model standardization to the data. 

2. Evaluate the one-class model and reject outliers. 

3. Apply the two-class model standardization to the original 
data. 

4. Apply the two-class model to obtain the decision value, /. 

5. Apply the probability model to convert / into P(BHB). 

This recipe leaves one with an SVM probability implicitly 
assuming that BHB stars and non-BHB stars are equal in number 
in the input sample. An appropriate prior should be applied to 
obtain the posterior probability. 

A. 1. Apply the one-class model standardization 
The equation for the standardization is 

x s =^, (Al) 
cr 

where x are the colours, fi are the means and cr the standard 
deviations of each colour. For the one-class model, the standard- 
ization is performed using the one-class values of fi and cr given 
in Table rO 
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Table A.l. Standardization parameters for both one-class and 
two-class classifiers. 



Table A.3. Data for the one-class SVM model. 













ytCHi 


u-g 


g-r 


r — i 


i - z 




u-g 


g-r 


r - i 


( - z 


0.177629 


-0.125387 


-3.43222 


2.042456 


0.187264 


one-class 








0.853943 


-2.771393 


-3.613814 


0.063207 


-0.607082 




1.11589420 


-0.11549278 


-0.09550841 


-0.09199681 


0.11659 


0.665252 


0.272304 


-1.799065 


2.331353 


(T 


0.09485993 


0.08260171 


0.21371742 


0.15484431 


0.388386 


0.412248 


-0.926218 


-0.278366 


-3.513227 


two-class 










0.255163 


-0.863317 


-0.345117 


-1.129957 


0.309968 




1.12810095 


-0.12914688 


-0.10735233 


-0.09247437 


1 


2.562787 


-3.10535 


-0.250291 


-0.258345 




0.09391075 


0.08442376 


0.18161768 


0.10137230 


0.382262 


-0.958194 


0.70813 


-1.986228 


2.899666 












0.281138 


-0.515436 


-2.064209 


-0.535715 


-2.544512 












1 


-1.327159 


-1.822084 


-2.210824 


1.937409 



Table A.2. Parameters for one-class, two-class and probability 
models. 

One-class model 

y 0.25 
p 2.571136 
Two-class model 
y 0.0625 
p 5.116558 
Probability model 
A -1.162976 
B 0.006035218 



A.2. Application of one-class SVM model 

The evaluation equation for the SVM model, for either one-class 
or two-class classification, is 



i=N s 



f = ytctiK (x, s,) -p 



(A.2) 



where x is the colour vector to be classified, s, are the support 
vectors, a, their fitted weights, y, are class labels for each support 
vector, and p is a constant offset applied to each result. The value 
of N s is 152 for the one-class mode l (Table lA.3l) and N s - 2645 
for the two class model (Table lA~4l . The class labels y, are set to 
+ 1 or -1 for the two class classifier, and are always set to +1 for 
the one-class case. 

K is a kernel function, in our case an RBF kernel, given by 



tf(x,s,) = exp(- r ||x-s ; || 2 ), 



(A.3) 



where y is a parameter found by tuning (Table \K2i . 

The values of the support vectors for the one-class model, 
corresponding to the vectors labeled s,- in Equations lA.2l and lA.3l 
are given in Table IA.3I which is available in its full form as an 
e-table. The first column in this table gives the product y^a, for 
each vector. The value of the parameters y and p are given in 
Table |Aj 

To apply the one-class model, simply calculate the sum in 
Equation I A. 21 over all the support vectors in Table I A. 31 and sub- 
tract the value of p. Sources with / > are compatible with the 
training data and are therefore suitable for classification with the 
two-class classifier. Sources with / < are outliers that should 
be rejected (they cannot be classified with the two-class model). 

A.3. Standardization for two-class classification 

Having rejected sources not compatible with the model, it is now 
necessary to standardize the data for the surviving sources using 
the standardization appropriate for the two-class classifier. The 
equation for this is identical to that used for the one-class stan- 
dardization, Equation I A. 1 1 above. The parameters are given in 
Table I A. 1 1 Note that this standardization should be performed 



Notes. Only the first ten lines of this table are shown here for illustra- 
tion, the remainder is available in electronic form. There are a total of 
152 support vectors in the online table. Column one is the product j,a,, 
columns two to five are the dereddened, standardized colours. 



on the original dereddened colours, not on the standardized data 
used for the one-class model. 



A.4. Application of the two-class model 

The equations for the two-class model are the same as for 
the one-class, namely IA.2I and IA.3I above. The data for the 
model should be taken from Table IA.4I (support vectors) and 
from Table IA.2I (model parameters). The decision value y, in 
Equation rA~2l is now either -1 (non-BHB) or +1 (BHB), but this 
is of no direct concern to the user since in Table lA.4l the value of 
the product y,a, is given. 

Evaluate Equation lA.2l using the two-class data to obtain the 
decision value / for each source. Decision values / > indi- 
cate BHB stars (since the class label for BHB stars is +1) and 
decision values / < indicate non-BHB stars. 

A.5. Determine the probability of the classification 

If only a classification is required, this step is unnecessary. If a 
probability is also required, apply the probability model to de- 
termine this. 

Given the value of the decision value / from step |A.4l above. 
determine the probability by evaluating 



P(BHB) = 



1 



1 + exp(A/ + fl)' 



(A.4) 



where A and B are parameters determined by cross validation 
during training. The values of these for our model are given in 
Table|A3] 

We note again that this is a 'nominal' probability, assum- 
ing equal class fractions in reality, no change in class fraction 
as a function of position, magnitude, etc. A prior should be in- 
troduced to obtain better posterior probabilities, as discussed in 
Sect. 13.41 The simple prior used in this paper of P(BHB) = 0.32, 
which roughly accounts for the uneven class fractions, is proba- 
bly the simplest sensible choice for this. 
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Table A.4. Data for the two-class SVM model. 



yioa 


u-g 


8-r 


r - i 


i - z 


1024 


0.669775 


0.534765 


0.034976 


0.123055 


1024 


-1.534446 


-1.443351 


-0.4881 


-0.636521 


1024 


0.275784 


-0.507595 


-0.355955 


-0.123561 


1024 


0.584588 


0.16757 


-0.251339 


-0.172884 


1024 


-0.182098 


0.04912 


-0.256845 


0.192107 


1024 


-0.352472 


-0.957706 


-0.543161 


-0.212342 


1024 


-0.586737 


0.85458 


0.657162 


-0.291259 


1024 


-0.97008 


1.340226 


0.18364 


0.724797 


1024 


-0.064965 


-0.720806 


-0.229315 


-0.498417 



Notes. Only the first ten lines are shown here for illustration. The re- 
mainder is available in electronic form. There are a total of 2,645 sup- 
port vectors in the full online table. The columns have the same mean- 
ing as for Table lA.3l above. Note that column 1, which gives the product 
yi(Xi, always has the same value for the first ten instances. This is not the 
case for all the support vectors in the full version. 
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